Following report is inspired by the article, in which lactic dehydrogenase (LDH), lymphocyte and high-sensitivity C-reactive protein (hs-CRP) were used to build prediction model, which determines if patient (with COVID-19) will survive or not with more than 90% accuracy. In this report, firstly, all biomarkers are visualized with basic statistics. Then, patients distribution (age and gender) is analyzed and visualized with histograms and table that shows also time of hospitalization and number of total blood tests taken. After pre-processing, some of patients are removed from data set, because they didn’t have any test registered, which makes further analysis impossible. After that, all biomarkers summary is presented in the table. Finding which biomarkers are correlated with outcome (dead or survived) helps to take a closer look at the most important columns in the data set and their impact on actual course of illness. So there are three most correlated biomarkers:
After an appropriate data cleaning, there is the prediction model build that reaches accuracy of 99%. Lactate dehydrogenase turned out to be the most important attribute in prediction, then lymphocytes comes and neutrophils at the third place. This makes following report reliable and consistent with the article mentioned above, where LDH is also the most important variable and lymphocyte is at the second place.
df <- read_excel("wuhan_blood_sample_data_Jan_Feb_2020.xlsx")
no_dates_df <- df %>% select(-c('Admission time', 'Discharge time', 'RE_DATE', 'PATIENT_ID', 'age', 'gender' ))
tbl_summary(
no_dates_df,
by = outcome,
missing = "no"
) %>%
add_n() %>%
modify_header(label = "**Biomarker**") %>%
modify_spanning_header(c("stat_1", "stat_2") ~ "**Final patient outcome related to the test**") %>%
bold_labels()
| Biomarker | N | Final patient outcome related to the test | |
|---|---|---|---|
| 0, N = 3,2151 | 1, N = 2,9051 | ||
| Hypersensitive cardiac troponinI | 507 | 3 (2, 7) | 70 (18, 631) |
| hemoglobin | 975 | 127 (116, 138) | 123 (110, 135) |
| Serum chloride | 975 | 101 (99, 103) | 104 (100, 111) |
| Prothrombin time | 662 | 13.6 (13.1, 14.1) | 16.3 (15.0, 18.2) |
| procalcitonin | 459 | 0.04 (0.02, 0.06) | 0.38 (0.14, 1.13) |
| eosinophils(%) | 957 | 0.70 (0.00, 1.80) | 0.00 (0.00, 0.10) |
| Interleukin 2 receptor | 268 | 529 (400, 742) | 1,180 (807, 1,603) |
| Alkaline phosphatase | 930 | 60 (50, 75) | 83 (64, 123) |
| albumin | 934 | 36 (34, 39) | 28 (24, 31) |
| basophil(%) | 957 | 0.20 (0.10, 0.40) | 0.10 (0.10, 0.20) |
| Interleukin 10 | 267 | 5 (5, 8) | 11 (6, 17) |
| Total bilirubin | 930 | 8 (6, 12) | 14 (10, 25) |
| Platelet count | 957 | 229 (176, 290) | 112 (55, 174) |
| monocytes(%) | 958 | 8.2 (6.3, 10.0) | 3.0 (2.0, 4.7) |
| antithrombin | 330 | 93 (86, 103) | 80 (70, 92) |
| Interleukin 8 | 268 | 11 (7, 19) | 30 (18, 61) |
| indirect bilirubin | 906 | 4.9 (3.4, 7.1) | 6.2 (4.2, 9.2) |
| Red blood cell distribution width | 923 | 12.20 (11.80, 12.80) | 13.20 (12.40, 14.40) |
| neutrophils(%) | 957 | 66 (56, 76) | 92 (88, 95) |
| total protein | 931 | 68 (65, 72) | 62 (57, 68) |
| Quantification of Treponema pallidum antibodies | 279 | 0.05 (0.04, 0.07) | 0.06 (0.04, 0.07) |
| Prothrombin activity | 659 | 94 (88, 103) | 66 (56, 78) |
| HBsAg | 279 | 0.00 (0.00, 0.01) | 0.01 (0.00, 0.02) |
| mean corpuscular volume | 957 | 89.8 (86.8, 91.9) | 91.3 (87.1, 96.4) |
| hematocrit | 957 | 37.1 (34.3, 39.9) | 35.9 (32.5, 39.8) |
| White blood cell count | 1,127 | 6 (4, 8) | 12 (8, 17) |
| Tumor necrosis factorα | 268 | 8 (6, 10) | 11 (8, 17) |
| mean corpuscular hemoglobin concentration | 957 | 343 (335, 350) | 342 (331, 350) |
| fibrinogen | 566 | 4.40 (3.56, 5.34) | 3.92 (2.44, 5.63) |
| Interleukin 1β | 268 | 5.0 (5.0, 5.0) | 5.0 (5.0, 5.0) |
| Urea | 936 | 4 (3, 5) | 11 (7, 17) |
| lymphocyte count | 957 | 1.25 (0.87, 1.62) | 0.46 (0.31, 0.69) |
| PH value | 384 | 6.50 (6.00, 7.00) | 6.50 (6.00, 7.41) |
| Red blood cell count | 1,127 | 4.2 (3.8, 4.7) | 4.0 (3.6, 4.6) |
| Eosinophil count | 957 | 0.03 (0.00, 0.09) | 0.00 (0.00, 0.01) |
| Corrected calcium | 914 | 2.37 (2.27, 2.44) | 2.35 (2.27, 2.44) |
| Serum potassium | 980 | 4.28 (3.92, 4.62) | 4.60 (4.04, 5.27) |
| glucose | 775 | 5.7 (5.0, 7.6) | 9.1 (6.9, 13.3) |
| neutrophils count | 957 | 3.5 (2.4, 5.2) | 10.8 (7.0, 15.2) |
| Direct bilirubin | 930 | 4 (2, 5) | 8 (5, 14) |
| Mean platelet volume | 862 | 10.40 (9.90, 11.00) | 11.30 (10.70, 12.20) |
| ferritin | 283 | 504 (235, 834) | 1,636 (928, 2,517) |
| RBC distribution width SD | 923 | 39.5 (37.6, 41.4) | 43.7 (39.9, 48.5) |
| Thrombin time | 566 | 16.40 (15.60, 17.30) | 17.30 (15.80, 19.75) |
| (%)lymphocyte | 958 | 24 (16, 33) | 4 (2, 7) |
| HCV antibody quantification | 279 | 0.06 (0.04, 0.08) | 0.07 (0.04, 0.11) |
| D-D dimer | 630 | 1 (0, 1) | 19 (3, 21) |
| Total cholesterol | 931 | 3.93 (3.39, 4.48) | 3.32 (2.72, 3.88) |
| aspartate aminotransferase | 935 | 21 (17, 29) | 38 (25, 59) |
| Uric acid | 934 | 240 (193, 304) | 245 (166, 374) |
| HCO3- | 934 | 24.7 (22.8, 26.7) | 21.8 (18.8, 24.7) |
| calcium | 979 | 2.17 (2.10, 2.25) | 2.00 (1.90, 2.08) |
| Amino-terminal brain natriuretic peptide precursor(NT-proBNP) | 475 | 64 (23, 166) | 1,467 (516, 4,578) |
| Lactate dehydrogenase | 934 | 220 (189, 278) | 593 (431, 840) |
| platelet large cell ratio | 862 | 28 (23, 33) | 35 (30, 42) |
| Interleukin 6 | 272 | 8 (2, 21) | 66 (30, 142) |
| Fibrin degradation products | 330 | 4 (4, 4) | 114 (18, 150) |
| monocytes count | 957 | 0.43 (0.32, 0.58) | 0.36 (0.20, 0.58) |
| PLT distribution width | 862 | 11.70 (10.70, 13.00) | 13.60 (12.10, 15.93) |
| globulin | 930 | 31.8 (29.5, 35.2) | 34.1 (30.2, 38.2) |
| γ-glutamyl transpeptidase | 930 | 29 (19, 46) | 42 (27, 79) |
| International standard ratio | 659 | 1.04 (0.99, 1.09) | 1.31 (1.17, 1.48) |
| basophil count(#) | 957 | 0.010 (0.010, 0.020) | 0.010 (0.010, 0.030) |
| 2019-nCoV nucleic acid detection | 501 | ||
| -1 | 444 (100%) | 57 (100%) | |
| mean corpuscular hemoglobin | 957 | 30.70 (29.60, 31.90) | 31.20 (29.90, 32.70) |
| Activation of partial thromboplastin time | 568 | 39 (35, 43) | 40 (36, 45) |
| High sensitivity C-reactive protein | 737 | 7 (2, 35) | 114 (65, 191) |
| HIV antibody quantification | 278 | 0.09 (0.08, 0.11) | 0.08 (0.07, 0.11) |
| serum sodium | 975 | 140 (138, 141) | 142 (138, 148) |
| thrombocytocrit | 862 | 0.24 (0.19, 0.30) | 0.15 (0.10, 0.21) |
| ESR | 383 | 26 (13, 40) | 36 (16, 59) |
| glutamic-pyruvic transaminase | 931 | 21 (15, 36) | 26 (18, 44) |
| eGFR | 936 | 100 (85, 114) | 72 (43, 91) |
| creatinine | 936 | 64 (54, 83) | 88 (68, 130) |
|
1
Statistics presented: Median (IQR); n (%)
|
|||
Data after cleaning:
patients_df <- df %>% group_by(`Admission time`, `Discharge time`, gender, age, outcome) %>%
summarise(PATIENT_ID = sum(PATIENT_ID, na.rm = TRUE), `Total records` = n())
patients_df <- patients_df %>%
mutate(`Days in hospital` = ceiling(difftime(`Discharge time`, `Admission time`, units = "days")))
df <- full_join(patients_df %>% ungroup() %>% select(`Admission time`, PATIENT_ID), df %>% select(-PATIENT_ID), by="Admission time")
df$gender<-ifelse(df$gender==1, 'Male', 'Female')
df <- df %>% mutate(gender = as.factor(gender))
patients_df$gender<-ifelse(patients_df$gender==1, 'Male', 'Female')
patients_df <- patients_df %>% mutate(gender = as.factor(gender))
df$outcome<-ifelse(df$outcome==1, 'Death', 'Survival')
df <- df %>% mutate(outcome = as.factor(outcome))
patients_df$outcome<-ifelse(patients_df$outcome==1, 'Death', 'Survival')
patients_df <- patients_df %>% mutate(outcome = as.factor(outcome))
patients_df <- df %>% group_by(PATIENT_ID, outcome, `Admission time`, `Discharge time`, gender, age) %>% summarise(missing_test=is.na(RE_DATE)) %>%
group_by(PATIENT_ID, outcome, `Admission time`, `Discharge time`, gender, age) %>%
summarise(test_provided = ifelse(missing_test, 0, 1)) %>%
group_by(PATIENT_ID, outcome, `Admission time`, `Discharge time`, gender, age) %>%
summarise(`Total blood tests`=sum(test_provided)) %>%
mutate(`Days in hospital` = ceiling(difftime(`Discharge time`, `Admission time`, units = "days")))
last_test_df <- df %>%
group_by(PATIENT_ID, outcome, gender, age) %>%
fill_(names(df)) %>%
fill_(names(df), "up") %>%
summarise_at(vars(`Hypersensitive cardiac troponinI`:creatinine), function(x) last(x,order_by = is.na(x)))
cleaned_df <- last_test_df %>% ungroup() %>% select(PATIENT_ID, outcome, age, gender, hemoglobin, `eosinophils(%)`, `Alkaline phosphatase`, albumin, `basophil(%)`, `Total bilirubin`, `Platelet count`, `monocytes(%)`, `neutrophils(%)`, `total protein`, `mean corpuscular volume`, hematocrit, `White blood cell count`, `mean corpuscular hemoglobin concentration`, Urea, `lymphocyte count`, `Red blood cell count`, `Eosinophil count`, `neutrophils count`, `Direct bilirubin`, `(%)lymphocyte`, `Total cholesterol`, `aspartate aminotransferase`, `Uric acid`, `HCO3-`, `Lactate dehydrogenase`, `monocytes count`, globulin, `γ-glutamyl transpeptidase`, `basophil count(#)`, `mean corpuscular hemoglobin`, `glutamic-pyruvic transaminase`, eGFR, creatinine) %>% filter_all(function(x) !is.na(x))
tbl_summary(
cleaned_df %>% ungroup() %>% select(-PATIENT_ID),
by = outcome
) %>%
add_n() %>%
modify_header(label = "") %>%
add_overall() %>%
bold_labels()
| Overall, N = 3541 | N | Death, N = 1621 | Survival, N = 1921 | |
|---|---|---|---|---|
| age | 62 (46, 70) | 354 | 69 (62, 77) | 51 (37, 62) |
| gender | 354 | |||
| Female | 147 (42%) | 45 (28%) | 102 (53%) | |
| Male | 207 (58%) | 117 (72%) | 90 (47%) | |
| hemoglobin | 125 (112, 138) | 354 | 120 (108, 138) | 127 (117, 138) |
| eosinophils(%) | 0.20 (0.00, 1.50) | 354 | 0.00 (0.00, 0.10) | 1.30 (0.60, 2.10) |
| Alkaline phosphatase | 72 (54, 98) | 354 | 94 (68, 131) | 60 (50, 75) |
| albumin | 33.2 (28.2, 37.6) | 354 | 27.6 (24.1, 31.1) | 37.1 (34.1, 39.3) |
| basophil(%) | 0.20 (0.10, 0.40) | 354 | 0.10 (0.10, 0.20) | 0.30 (0.20, 0.50) |
| Total bilirubin | 11 (7, 16) | 354 | 13 (10, 22) | 8 (6, 12) |
| Platelet count | 189 (113, 257) | 354 | 109 (54, 168) | 242 (196, 299) |
| monocytes(%) | 6.2 (2.9, 8.9) | 354 | 2.8 (2.0, 4.6) | 8.4 (7.0, 10.1) |
| neutrophils(%) | 78 (62, 92) | 354 | 93 (88, 95) | 64 (55, 71) |
| total protein | 66 (61, 70) | 354 | 62 (57, 68) | 68 (65, 71) |
| mean corpuscular volume | 90.5 (87.0, 94.3) | 354 | 91.2 (87.1, 96.2) | 90.1 (87.0, 92.7) |
| hematocrit | 36.3 (33.0, 40.2) | 354 | 35.3 (32.1, 40.6) | 37.0 (34.1, 40.1) |
| White blood cell count | 8 (5, 13) | 354 | 12 (8, 17) | 6 (4, 8) |
| mean corpuscular hemoglobin concentration | 342 (332, 349) | 354 | 342 (328, 350) | 342 (334, 349) |
| Urea | 5 (4, 11) | 354 | 12 (7, 20) | 4 (3, 5) |
| lymphocyte count | 0.98 (0.52, 1.54) | 354 | 0.50 (0.30, 0.72) | 1.47 (1.11, 1.81) |
| Red blood cell count | 4.10 (3.55, 4.65) | 354 | 3.96 (3.51, 4.62) | 4.16 (3.60, 4.65) |
| Eosinophil count | 0.02 (0.00, 0.09) | 354 | 0.00 (0.00, 0.01) | 0.08 (0.03, 0.12) |
| neutrophils count | 5 (3, 11) | 354 | 12 (8, 16) | 3 (3, 5) |
| Direct bilirubin | 5 (3, 7) | 354 | 7 (5, 12) | 3 (2, 5) |
| (%)lymphocyte | 14 (4, 28) | 354 | 4 (2, 7) | 26 (19, 33) |
| Total cholesterol | 3.72 (2.95, 4.37) | 354 | 3.13 (2.58, 3.65) | 4.25 (3.65, 4.71) |
| aspartate aminotransferase | 25 (19, 41) | 354 | 40 (28, 66) | 20 (16, 25) |
| Uric acid | 260 (198, 346) | 354 | 258 (188, 391) | 260 (204, 326) |
| HCO3- | 23.9 (20.9, 26.4) | 354 | 21.0 (17.5, 23.6) | 25.6 (23.8, 27.4) |
| Lactate dehydrogenase | 274 (197, 615) | 354 | 652 (471, 889) | 202 (177, 240) |
| monocytes count | 0.43 (0.31, 0.61) | 354 | 0.38 (0.20, 0.60) | 0.47 (0.37, 0.62) |
| globulin | 32.4 (28.9, 35.7) | 354 | 34.1 (30.7, 38.0) | 30.9 (28.1, 33.4) |
| γ-glutamyl transpeptidase | 33 (21, 55) | 354 | 42 (27, 75) | 28 (18, 44) |
| basophil count(#) | 0.020 (0.010, 0.030) | 354 | 0.010 (0.010, 0.030) | 0.020 (0.010, 0.030) |
| mean corpuscular hemoglobin | 30.90 (29.70, 32.20) | 354 | 31.20 (29.92, 32.48) | 30.80 (29.60, 32.12) |
| glutamic-pyruvic transaminase | 26 (17, 42) | 354 | 29 (18, 47) | 24 (16, 38) |
| eGFR | 90 (67, 105) | 354 | 68 (35, 92) | 99 (86, 112) |
| creatinine | 74 (58, 97) | 354 | 94 (65, 151) | 65 (55, 83) |
|
1
Statistics presented: Median (IQR); n (%)
|
||||
It’s good to note, that there are patients, who didn’t have any test taken.
patients_df %>% filter(`Total blood tests`==0)
| PATIENT_ID | outcome | Admission time | Discharge time | gender | age | Total blood tests | Days in hospital |
|---|---|---|---|---|---|---|---|
| 187 | Survival | 2020-02-17 18:56:09 | 2020-02-20 20:55:31 | Male | 44 | 0 | 4 days |
| 189 | Survival | 2020-02-10 04:37:30 | 2020-02-10 13:54:23 | Male | 61 | 0 | 1 days |
| 192 | Survival | 2020-02-16 17:14:30 | 2020-02-16 21:05:17 | Male | 34 | 0 | 1 days |
| 197 | Survival | 2020-02-10 05:01:15 | 2020-02-11 15:40:41 | Male | 67 | 0 | 2 days |
| 200 | Survival | 2020-02-16 04:41:21 | 2020-02-16 15:26:13 | Male | 25 | 0 | 1 days |
| 201 | Survival | 2020-02-17 21:30:07 | 2020-02-20 13:05:11 | Male | 39 | 0 | 3 days |
| 253 | Death | 2020-02-13 21:05:54 | 2020-02-14 11:00:05 | Male | 51 | 0 | 1 days |
| 268 | Death | 2020-02-14 11:46:36 | 2020-02-15 10:15:28 | Male | 69 | 0 | 1 days |
| 285 | Death | 2020-01-31 23:20:40 | 2020-02-01 03:16:34 | Male | 63 | 0 | 1 days |
| 289 | Death | 2020-02-01 02:12:05 | 2020-02-01 10:54:57 | Male | 63 | 0 | 1 days |
| 311 | Death | 2020-02-11 23:45:15 | 2020-02-15 09:02:41 | Female | 77 | 0 | 4 days |
| 347 | Death | 2020-02-11 22:25:20 | 2020-02-15 10:03:32 | Female | 80 | 0 | 4 days |
| 354 | Death | 2020-02-03 21:22:41 | 2020-02-04 01:03:11 | Male | 57 | 0 | 1 days |
| 359 | Death | 2020-02-11 01:42:48 | 2020-02-14 09:38:13 | Male | 65 | 0 | 4 days |
patients_summary <- patients_df %>% ungroup() %>% select(-c(age, PATIENT_ID))
tbl_summary(
patients_summary,
by = outcome,
label = gender ~ "Gender",
) %>%
add_n() %>%
modify_header(label = "") %>%
add_overall() %>%
bold_labels()
| Overall, N = 3751 | N | Death, N = 1741 | Survival, N = 2011 | |
|---|---|---|---|---|
| Gender | 375 | |||
| Female | 151 (40%) | 48 (28%) | 103 (51%) | |
| Male | 224 (60%) | 126 (72%) | 98 (49%) | |
| Total blood tests | 16 (9, 21) | 375 | 14 (7, 24) | 16 (12, 20) |
| Days in hospital | 10 (5, 16) | 375 | 6 (3, 10) | 14 (10, 18) |
|
1
Statistics presented: n (%); Median (IQR)
|
||||
ggplot(patients_df, aes(x=age,fill=gender)) + geom_histogram(binwidth = 1) + facet_grid(. ~ gender) + scale_x_continuous(name="Age", limits=c(min(df$age), max(df$age)), breaks = seq(0, 100, by=10)) + scale_y_continuous(name = "Number of patients", limits = c(0,10), breaks = seq(0,10, by=1)) +
theme_minimal()
ggplot(patients_df, aes(x=age,fill=outcome)) + geom_histogram(binwidth = 1) + facet_grid(outcome ~ gender) + scale_x_continuous(name="Age", limits=c(min(df$age), max(df$age)), breaks = seq(0, 100, by=10)) + scale_y_continuous(name = "Number of patients", limits = c(0,9), breaks = seq(0,9, by=1)) +
theme_minimal()
ggplot(patients_df, aes(x=`Days in hospital`, fill=outcome)) + geom_histogram(binwidth = 1) + facet_grid(. ~ outcome) + ylab("Number of patients") +
theme_minimal()
ggplot(patients_df, aes(x=`Total blood tests`, fill=outcome)) + geom_histogram(binwidth = 1) + facet_grid(. ~ outcome) + ylab("Number of patients") +
scale_x_continuous(name="Total blood tests", limits=c(0, 60), breaks = seq(0, 60, by=10)) +
scale_y_continuous(name = "Number of patients", limits = c(0,20), breaks = seq(0,20, by=2)) +
theme_minimal()
Finding correlation between outcome and other variables.
cor_df <-cleaned_df
cor_df$outcome<-ifelse(cor_df$outcome=='Death', 1, 0)
cor_df$gender<-ifelse(cor_df$gender=='Male', 1, 0)
cor_df <- cor_df %>% rename(isMale=gender)
cor_df <- cor_df %>% rename(Death=outcome)
cor_df <- cor_df %>% select (-PATIENT_ID)
corrMatrix <- cor(cor_df[sapply(cor_df, is.numeric)], use='pairwise.complete.obs')
correlation_df <- as.data.frame(corrMatrix)
correlation_df %>% rownames_to_column('variable') %>% filter(variable != 'Death') %>% select(variable, Death) %>% mutate(Death = abs(Death)) %>%
arrange(desc(Death)) %>%
rename(`Outcome correlation` = Death) %>%
head(10)
| variable | Outcome correlation |
|---|---|
| (%)lymphocyte | 0.7608148 |
| neutrophils(%) | 0.7594852 |
| albumin | 0.7164566 |
| Lactate dehydrogenase | 0.6892054 |
| neutrophils count | 0.6300651 |
| Platelet count | 0.5817254 |
| age | 0.5569920 |
| eosinophils(%) | 0.5511621 |
| HCO3- | 0.5371294 |
| monocytes(%) | 0.5097846 |
In following section, we will use 3 biomarkers that are most correlated to outcome and visualize theirs mean values among each patient at 3D graph. So we take only these patients who had all of these 3 biomarkers tested at least once. If somebody was tested more than once, then the last value will be taken.
corr_visualize_df <- cleaned_df
mycolors <- c('royalblue1', 'darkcyan')
corr_visualize_df$color <- mycolors[ as.numeric(corr_visualize_df$outcome) ]
par(mar=c(0,0,0,0))
plot3d(
x=corr_visualize_df$`neutrophils(%)`, y=corr_visualize_df$`(%)lymphocyte`, z=corr_visualize_df$albumin,
col = corr_visualize_df$color,
type = 's',
radius = 1,
legend=TRUE,
xlab="Neutrophils(%)", ylab="Lymphocyte(%)", zlab="Albumin")
legend3d("topright", legend = c('Death', 'Survival'), pch = 10, col = mycolors, cex=0.8, inset=c(0.02))
writeWebGL( filename="3d_correlation_mean.html" , width=600, height=600)
htmltools::includeHTML("./3d_correlation_mean.html")
You must enable Javascript to view this page properly.
In following section there are 3 animations that show how most common biomarkers’ tests results (which means that these biomarkers were tested most often among all the patients) varied during hospitalization.
biomarker_popularity <- df %>% group_by(PATIENT_ID, outcome) %>% summarise_at(vars(`Hypersensitive cardiac troponinI`:creatinine), function(x) sum(!is.na(x)))
biomarker_popularity$sum_of_single_test <- rowSums(biomarker_popularity %>% ungroup() %>% select(-PATIENT_ID, -outcome))
most_tested_death <- biomarker_popularity %>% select(sum_of_single_test, outcome, `neutrophils(%)`, `(%)lymphocyte`, albumin) %>% filter(outcome=='Death') %>% arrange(desc(sum_of_single_test)) %>% head(10)
most_tested_survival <- biomarker_popularity %>% select(sum_of_single_test, outcome, `neutrophils(%)`, `(%)lymphocyte`, albumin) %>% filter(outcome=='Survival') %>% arrange(desc(sum_of_single_test)) %>% head(10)
foo <- rbind(most_tested_death, most_tested_survival)
patients_tests <- merge(foo %>% ungroup() %>% select(PATIENT_ID), df %>% ungroup() %>% select(PATIENT_ID, RE_DATE,`neutrophils(%)`, `(%)lymphocyte`, albumin, outcome), by="PATIENT_ID") %>% filter(!is.na(`neutrophils(%)`) | !is.na(`(%)lymphocyte`) | !is.na(albumin))
lymphocyte_seq <- patients_tests %>% filter(!is.na(`(%)lymphocyte`))
lymphocyte_seq %>% mutate(PATIENT_ID=as.factor(PATIENT_ID)) %>%
ggplot( aes(x=RE_DATE, y=`(%)lymphocyte`, group=PATIENT_ID, color=PATIENT_ID)) +
geom_line() +
geom_point() +
facet_grid(rows=vars(outcome)) +
ggtitle("Lymphocyte (%) during patient hospitalization") +
theme_ipsum() +
ylab("Lymphocyte (%)") +
transition_reveal(RE_DATE)
#anim_save("lymphocyte_seq.gif")
neutrophils_seq <- patients_tests %>% filter(!is.na(`neutrophils(%)`))
neutrophils_seq %>% mutate(PATIENT_ID=as.factor(PATIENT_ID)) %>%
ggplot( aes(x=RE_DATE, y=`neutrophils(%)`, group=PATIENT_ID, color=PATIENT_ID)) +
geom_line() +
geom_point() +
facet_grid(rows=vars(outcome)) +
ggtitle("Neutrophils (%) during patient hospitalization") +
theme_ipsum() +
ylab("Neutrophils (%)") +
transition_reveal(RE_DATE)
#anim_save("neutrophils_seq.gif")
albumin_seq <- patients_tests %>% filter(!is.na(albumin))
albumin_seq %>% mutate(PATIENT_ID=as.factor(PATIENT_ID)) %>%
ggplot( aes(x=RE_DATE, y=albumin, group=PATIENT_ID, color=PATIENT_ID)) +
geom_line() +
geom_point() +
facet_grid(rows=vars(outcome)) +
ggtitle("Albumin value during patient hospitalization") +
theme_ipsum() +
ylab("Albumin") +
transition_reveal(RE_DATE)
#anim_save("albumin_seq.gif")
In order to build an appropriate model, there is a need for the data cleaning. There shouldn’t be any NA in the data set. Moreover, every considered patient should have all of considered biomarkers tested at least once. For building the model purpose, the last existing test of each biomarker for each patient was taken. As it was shown before, many biomarkers are unsuitable because of containing too many NA in theirs columns. Which means that they should be removed from the data set. Below is the list of remaining columns:
With condition of existing at least one of listed above biomarkers tests, 21 patients should be ignored in building model process.
cleaned_df <- cleaned_df %>% select(-PATIENT_ID)
set.seed(23)
inTraining <-
createDataPartition(
y = cleaned_df$outcome,
p = .70,
list = FALSE)
training <- cleaned_df[ inTraining,]
testing <- cleaned_df[-inTraining,]
rfGrid <- expand.grid(mtry = 10:30)
ctrl <- trainControl(
method = "repeatedcv",
classProbs = TRUE,
number = 2,
repeats = 5)
set.seed(23)
fit <- train(outcome ~ .,
data = training,
method = "rf",
metric = "ROC",
preProc = c("center", "scale"),
trControl = ctrl,
tuneGrid = rfGrid,
ntree = 30)
rfClasses <- predict(fit, newdata = testing)
confusionMatrix(data = rfClasses, testing$outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Death Survival
## Death 48 1
## Survival 0 56
##
## Accuracy : 0.9905
## 95% CI : (0.9481, 0.9998)
## No Information Rate : 0.5429
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9808
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9825
## Pos Pred Value : 0.9796
## Neg Pred Value : 1.0000
## Prevalence : 0.4571
## Detection Rate : 0.4571
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.9912
##
## 'Positive' Class : Death
##
99% of accuracy means that the model works very well. That gives only one false positive, which is definitely better than a false negative. There is no need to improve the model.
varImp(fit)
## rf variable importance
##
## only 20 most important variables shown (out of 36)
##
## Overall
## `Lactate dehydrogenase` 100.000
## `(%)lymphocyte` 56.749
## `neutrophils(%)` 31.971
## `Eosinophil count` 17.594
## albumin 15.974
## `neutrophils count` 14.715
## `monocytes(%)` 14.162
## `Platelet count` 13.067
## `lymphocyte count` 10.040
## `aspartate aminotransferase` 9.925
## `eosinophils(%)` 6.399
## age 3.518
## `Direct bilirubin` 3.514
## `Total cholesterol` 3.081
## `HCO3-` 2.540
## Urea 2.374
## eGFR 2.171
## `γ-glutamyl transpeptidase` 1.967
## `Alkaline phosphatase` 1.607
## `total protein` 1.047